Data

This TB Hackday script uses (pre-processed) RNA sequencing data from the following study to identify granuloma-associated T cell genes:

  1. Foreman et al. 2023 (J Exp Med) CD30 co-stimulation drives differentiation of protective T cells during Mycobacterium tuberculosis infection

Then the expression of these genes can be further examined and validated using pseudo-bulk transcriptomic data from scRNAseq data in this study:

  1. Bromley et al. 2024 (Immunity) CD4+ T cells re-wire granuloma cellularity and regulatory networks to promote immunomodulation following Mtb reinfection

Background

Foreman et al. compared the gene expression of T cells isolated from PBMC vs. T cells from granulomas (n=4 NHP, n=23 granulomas). They identified genes that were associated with granuloma T cells, and specifically identified genes that were correlated with Mtb burden in the granuloma (CFU).

Foreman et al. 2023 Granuloma-associated T cell genes

Bromley et al. report on an experiment with three groups of cynomolgus macaques: (1) anti-CD4 treated, Mtb-exposed (n=7), (2) IgG control, Mtb-exposed (n=6), (3) No treatment, Mtb-naive (n=6). Groups (2) and (3) were infected with Mtb, then given an anti-CD4 antibody to deplete CD4+ T cells or an isotype control (IgG) antibody, and finally challenged with a secondary Mtb infection. Group 3 only received a primary Mtb infection. Granulomas were then analyzed using single-cell RNAseq, with 3 NHP from (1) and 2 NHP from (2) and (3) each. We have created pseudo-bulk datasets using two different clustering of the cells, offering two levels of cell type granularity for analysis: bromley_X_pseudobulk_counts.csv where X is coarse or subclustering, with clusters defined by the authors of the manuscript. By analyzing data from primary infection, reinfection, and reinfection-CD4+ T cell-depleted granulomas, they found that the presence of CD4+ T cells during reinfection resulted in a less inflammatory lung milieu characterized by reprogrammed CD8+ T cells, reduced neutrophilia, and blunted type 1 immune signaling among myeloid cells.

Bromley et al. 2024 CD4+ T cells re-wire granuloma cellularity and regulatory networks to promote immunomodulation following Mtb reinfection

Hypotheses for hacking

Setup R and load the data.

Load relevant packages. Change <data_dir> variable as appropriate.

Load the pre-processed RNA sequencing data. There are several important files:

  1. foreman_et_al_counts.csv contains log2-transformed counts per million (log2-CPM), that were computed from raw counts by the study authors. The table contains 84 columns, with one column gene_id and the remaining columns matching sampleids in the metadata. There are 30,689 genes in the dataset.

  2. foreman_etal_meta.csv contains all the 83 sample- and granuloma-level metadata that is available for these samples and animals. There are 27 CD8+ T cell samples sorted from granulomas and 31 CD4+ T cell samples sorted from granulomas. Other variables include subject, sort, condition, sex, and granuloma CFU.

  3. bromley_coarse_pseudobulk_counts.csv is a long-form table containing counts for 33 granulomas from 7 NHP, with over 25K genes from 15 coarse clusters of cells from the granulomas (44 sub-clusters). There are columns for biosample_id, CoarseClustering, counts and gene. Counts are not log-normalized. Pseudo-bulk counts were created by summing the counts across all cells in a cluster for each sample.

  4. bromley_coarse_pseudobulk_counts.meta.csv contains meta-data that can be joined to the counts data on biosample_id. Variables include donor_id, Group, CFU (for the granuloma), and CoarseClustering/SubclusteringV2.

library(tidyverse)   # For data manipulation and plotting
library(stats)       # For basic statistical tests
library(multcomp)    # For multiple comparison corrections
library(lme4)        # For mixed-effects modeling
library(broom)       # For tidying model outputs
library(data.table)  # Efficient data handling
library(edgeR)       # RNAseq data processing
library(limma)       # RNAseq linear models
library(ggrepel)     # Label points in scatterplot
library(kimma)       # RNAseq linear mixed effects models
library(BIGpicture)  # RNAseq plots
select <- dplyr::select

# NOTE --- REPLACE the <data_dir> FOLDER DESTINTATION AS APPROPRIATE
# data_dir <- '/home/processed_data'
data_dir <- '/fh/fast/gilbert_p/fg_data/SEATRAC/TB_hackday_2024/processed_data'
# data_dir <- "data/"

# These are already log2-CPM
ncts <- readr::read_csv(file.path(data_dir, "foreman_etal_counts.csv"))
meta <- readr::read_csv(file.path(data_dir, "foreman_etal_meta.csv"))

meta <- meta %>%
  mutate(logCFU = log10(CFU + 1))

Identify genes that are associated with granuloma T cells

This analysis should match part of the analysis presented in Foremant et al. Figure 1. It’s a bit of a strange statistical test because it compares gene expression in each granuloma T cell sort (n=31/27 for CD4/CD8) to gene expression in each PBMC sample (n=4), but we can use a rank-based test to find the genes that are differentially expressed in the two tissues. These will become candidate genes for downstream analysis, so statistical significance is not critical.

Steps for analysis: 1. Filter genes, keeping those that have at least 1 log-normalized count in 60% of the samples. 2. Filter to just the CD4 granuloma and PBMC samples (could alternatively focus on CD8) 3. Prepare the data as a matrix for computationally efficient testing 4. Apply Mann-Whitney test across all genes and collect the results 5. Print the top 20 results with an estimated FDR q-value

# Filter genes for analysis
# ------------------------------------
ltpm <- ncts %>%
  column_to_rownames("gene_id") %>%
  filter(rowMeans(. > 1) > 0.6)

genes_filter <- rownames(ltpm)

# Merge metadata and log-transformed counts
meta_ss <- meta %>%
  filter(condition %in% c("CD4_gran", "CD4_PBMC")) %>%
  column_to_rownames("sampleid")

# Ensure `ltpm` only includes columns for the filtered samples
ltpm_ss <- ltpm %>%
  select(all_of(rownames(meta_ss)))

mdf <- meta_ss %>%
  rownames_to_column("sampleid") %>%
  bind_cols(t(ltpm_ss))  # Ensure samples (columns of ltpm) match rows of meta

# Mann-Whitney U Test for Gene Associations
# ----------------------------------------------------

# Convert mdf to a matrix for gene expression data
expression_data <- as.matrix(mdf %>% select(all_of(genes_filter)))
conditions <- mdf$condition

# Precompute the grouping indices for faster sub-setting
gran_idx <- which(conditions == "CD4_gran")
pbmc_idx <- which(conditions == "CD4_PBMC")

# Vectorized function to compute statistics for all genes
collect_res_matrix <- function(expression_data, gran_idx, pbmc_idx, genes) {
      gran_expr <- expression_data[gran_idx, , drop = FALSE]
      pbmc_expr <- expression_data[pbmc_idx, , drop = FALSE]
      
      # Perform Wilcoxon tests in bulk
      pvalues <- apply(expression_data, 2, function(gene_expr) {
        wilcox.test(gene_expr[gran_idx], gene_expr[pbmc_idx])$p.value
      })
      
      # Calculate mean differences
      mean_gran <- colMeans(gran_expr, na.rm = TRUE)
      mean_pbmc <- colMeans(pbmc_expr, na.rm = TRUE)
      mean_diff <- mean_gran - mean_pbmc
      
      # Assign "GRAN" or "PBMC" based on the mean difference
      assoc <- ifelse(mean_diff > 0, "GRAN", "PBMC")
      
      # Combine results into a data frame
      results <- data.frame(
        gene = genes,
        pvalue = pvalues,
        assoc = assoc,
        stringsAsFactors = FALSE
      )
      
      return(results)
    }

# Run the matrix-optimized function
results_df <- collect_res_matrix(expression_data, gran_idx, pbmc_idx, genes_filter)

# Adjust p-values for multiple comparisons
results_df <- results_df %>%
  mutate(FDRq = p.adjust(pvalue, method = "fdr")) %>%
  arrange(pvalue)

# Other genes of interest from the manuscript
cd4_genes <- c("KLRB1", "CD40LG", "S100A11", "S100A4", "IL26", "BATF")
cd8_genes <- c("APOBEC3G", "IFNG", "TNF", "CCL1", "CCL20")

gran_genes <- results_df %>%
  filter(FDRq < 0.1) %>%
  filter(assoc == 'GRAN') %>%
  pull(gene)

print(head(results_df, 20))
##                      gene      pvalue assoc        FDRq
## APOBEC3H         APOBEC3H 3.81971e-05  GRAN 0.007986249
## CD2                   CD2 3.81971e-05  GRAN 0.007986249
## CHMP4A             CHMP4A 3.81971e-05  GRAN 0.007986249
## CXCR6               CXCR6 3.81971e-05  GRAN 0.007986249
## DNAJB1             DNAJB1 3.81971e-05  GRAN 0.007986249
## EEF1G               EEF1G 3.81971e-05  PBMC 0.007986249
## FAM65B             FAM65B 3.81971e-05  PBMC 0.007986249
## FTL                   FTL 3.81971e-05  PBMC 0.007986249
## GAPDH               GAPDH 3.81971e-05  GRAN 0.007986249
## GIMAP7             GIMAP7 3.81971e-05  GRAN 0.007986249
## IL2RA               IL2RA 3.81971e-05  GRAN 0.007986249
## ITGAE               ITGAE 3.81971e-05  GRAN 0.007986249
## LOC100423954 LOC100423954 3.81971e-05  GRAN 0.007986249
## LOC100425072 LOC100425072 3.81971e-05  PBMC 0.007986249
## LOC100426537 LOC100426537 3.81971e-05  GRAN 0.007986249
## LOC106993126 LOC106993126 3.81971e-05  GRAN 0.007986249
## LOC694538       LOC694538 3.81971e-05  GRAN 0.007986249
## LOC695218       LOC695218 3.81971e-05  PBMC 0.007986249
## LOC702809       LOC702809 3.81971e-05  PBMC 0.007986249
## LOC703774       LOC703774 3.81971e-05  PBMC 0.007986249

Fit the model to identify genes associated with protection (lower CFU)

Now we will test the granuloma-associated genes to see if they are associated with protection.

The sampleid columns of the ncts variable and the rows of sampleid in the meta variable match. For this first analysis we will focus on the CD8 T cells sorted from PBMC or granulomas, creating subset tables indicated by _ss variable. Then we initialize the DGEList object and create a limma-voom model with a design matrix to identify genes that are associated with granuloma Mtb burden (CFU).

In the accompanying “mean-variance” plot the x-axis represents the average expression levels of genes across all samples. The y-axis represents the square-root of the variance (i.e., standard deviation) of gene expression levels. It shows how the variance changes with respect to the mean expression. Every dot is a gene and the trend line shows the relationship between the mean and the variance. Note that the variance is relatively stable across expression levels and the relationship is smooth; this is good for analysis and voom will use this relationship to adjust the model fits of each gene. If you re-run the code block without the filtering you will see the impact on the mean-variance plot.

meta_ss = meta %>% filter(condition == "CD4_gran")
keep_ids = meta_ss %>% pull(sampleid)
keep_ids = c('gene_id', keep_ids)

ncts_ss = ncts %>% dplyr::select(any_of(keep_ids))

# Keep only the genes from 
ncts_ss <- ncts_ss %>%
        filter(gene_id %in% gran_genes)

# Move gene ID to rownames
ncts_ss_mat <- as.matrix(ncts_ss[,-1])
rownames(ncts_ss_mat) = ncts_ss$gene_id

# FOR TESTING ALL GENES: alternatively, discard genes that have low counts/prevalence
# filter = rowSums(ncts_ss > 1) >= (0.5 * ncol(ncts_ss))
# ncts_ss = ncts_ss[filter, ]

# Create the object for differential expression testing
dge_o = DGEList(counts=ncts_ss_mat,
                genes=rownames(ncts_ss_mat),
                samples=meta_ss,
                group=meta_ss[['logCFU']])

# Specify the model/design matrix
design_temp = model.matrix(~logCFU, data=dge_o$samples)

# Create the voom object and fit the model
v <- voomWithQualityWeights(dge_o, design=design_temp, plot=TRUE)

#Fit model
fit = lmFit(v, design_temp)

# Estimate contrasts and p-values
fit = eBayes(fit, robust=TRUE)

summary(decideTests(fit, adjust.method="fdr", p.value = 0.2))
##        (Intercept) logCFU
## Down             0     33
## NotSig           0    240
## Up             298     25
results <- topTable(fit, adjust="BH", coef="logCFU", p.value=1, number=Inf, resort.by="P")

head(results %>% dplyr::select(genes, logFC, AveExpr, P.Value, adj.P.Val), 20)
##                     genes       logFC  AveExpr      P.Value  adj.P.Val
## SYTL2               SYTL2  0.11492280 11.41220 0.0004588508 0.07600388
## ASH1L               ASH1L  0.11836029 11.51505 0.0006027642 0.07600388
## KLRB1               KLRB1 -0.18640544 11.59983 0.0007651397 0.07600388
## PSMA2               PSMA2 -0.04935484 12.30006 0.0024829731 0.10056038
## CHD1                 CHD1  0.08975339 11.45801 0.0026723083 0.10056038
## LOC100426632 LOC100426632 -0.08040552 12.18295 0.0030091584 0.10056038
## LOC100430627 LOC100430627 -0.06177910 12.35178 0.0035021669 0.10056038
## FOSB                 FOSB  0.15424175 11.27775 0.0038699198 0.10056038
## PRDX1_1           PRDX1_1 -0.04523569 12.13187 0.0039487373 0.10056038
## TSPAN5             TSPAN5  0.10415933 11.48621 0.0047585952 0.10056038
## PHF20               PHF20  0.07334830 11.60675 0.0050061585 0.10056038
## RRBP1               RRBP1  0.16407615 11.17417 0.0051004715 0.10056038
## LOC100423954 LOC100423954 -0.10162656 12.24038 0.0054423765 0.10056038
## DNAJB1             DNAJB1  0.08412195 12.36983 0.0058819762 0.10056038
## LOC721878       LOC721878 -0.08548157 11.63150 0.0061394561 0.10056038
## RARRES3           RARRES3 -0.04687187 12.35166 0.0063386579 0.10056038
## MYL6                 MYL6 -0.02998800 12.55174 0.0065410830 0.10056038
## TRAF1               TRAF1  0.10035311 11.61048 0.0066815712 0.10056038
## RNF4                 RNF4  0.06467035 11.81728 0.0070282116 0.10056038
## SERPINB9         SERPINB9  0.05040778 11.87255 0.0073799232 0.10056038

Create a volcano plot for single-gene association with protection

# Add a column for significance based on FDR
results <- results %>%
  mutate(Significance = ifelse(adj.P.Val < 0.2, "Significant", "Not Significant"))

# Select the top 10 genes based on adjusted p-value for labeling
top_genes <- results %>%
  arrange(adj.P.Val) %>%
  slice_head(n = 10)

max_logFC <- max(abs(results$logFC), na.rm = TRUE)

# Create the volcano plot
volcano_plot <- ggplot(results, aes(x = logFC, y = -log10(P.Value))) +
  geom_point(aes(color = Significance), alpha = 0.6) +
  scale_color_manual(values = c("Significant" = "red", "Not Significant" = "grey")) +
  geom_text_repel(data = top_genes,
                  aes(label = genes),
                  max.overlaps = 10,
                  box.padding = 0.3,
                  point.padding = 0.3,
                  segment.color = "grey50",
                  size = 3) +
  xlim(c(-max_logFC, max_logFC)) +
  theme_minimal() +
  labs(
    x = "log2 Estimate",
    y = "-log10 P-value",
    color = "FDR < 0.2") +
  theme(plot.title = element_text(hjust = 0.5))

volcano_plot

Redo the analysis using a mixed-effects model to incorporate repeated measures

Using kimma, perform the same analysis for CFU only now taking into account the within animal variability with multiple granulomas samples from the same animal. We will still focus on CD4+ cells from granulomas.

# Use the same voom object as with limma

klm <- kmFit(dat = v,
             model = "~logCFU + (1|subject)",
             run_lme = TRUE,
             libraryID="sampleid",
             patientID="subject",
             use_weights = TRUE,
             metrics = FALSE,
             processors=1)
## lme/lmerel model: expression~logCFU+(1|subject)
## Input: 31 libraries from 4 unique patients
## Model: 31 libraries
## Complete: 298 genes
## Failed: 0 genes
summarise_kmFit(fdr = klm$lme)
## # A tibble: 3 × 7
##   variable          `fdr<0.05` `fdr<0.1` `fdr<0.2` `fdr<0.3` `fdr<0.4` `fdr<0.5`
##   <fct>                  <int>     <int>     <int>     <int>     <int>     <int>
## 1 (1 | subject)             30        36        44        53        57        67
## 2 logCFU                     0        29        63        77       117       138
## 3 total (nonredund…         30        65       101       119       154       178
plot_volcano(model_result = klm,
             model = "lme", variables = "logCFU",
             y_cutoff = 0.2, label = 10)

Visulize results for a single gene

# Scatter-plot of Gene Expression vs. log-CFU
gene <- "TNFRSF4"

ggplot(mdf, aes(x = logCFU, y = get(gene))) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(
    title = paste("Correlation of", gene, "with logCFU"),
    x = "logCFU",
    y = paste("Expression of", gene)
  ) +
  theme_minimal()
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Boxplot of Gene Expression by Condition
ggplot(mdf, aes(x = condition, y = get(gene))) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.7) +
  labs(
    title = paste("Expression of", gene, "by Condition"),
    x = "Condition",
    y = paste("Expression of", gene)
  ) +
  theme_minimal()

Use the Bromley et al. data to identify cells that express protective genes and validate their association with protection

Load the pseudo-bulk data from Bromley et al. (2024). Note above that it can be loaded with two different granularity of clusters: coarse vs. subclusters. Below we just do a simple rank correlation of gene expression (CPM) with log-CFU (across granulomas) to see which of the genes from the above analysis correlate with protection in these experiments, and in which cell cluster.

# Load Bromley data
brom <- read_csv(file.path(data_dir, "bromley_coarse_pseudobulk_counts.csv"))
## New names:
## Rows: 11603648 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): biosample_id, CoarseClustering, gene dbl (2): ...1, counts
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
bmeta <- read_csv(file.path(data_dir, "bromley_coarse_pseudobulk_counts.meta.csv"))
## New names:
## Rows: 1251 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (12): ...1, biosample_id, donor_id, array number, Sample Name, Sample ty... dbl
## (2): Infusion before 2nd Mtb infection anti CD4 or IgG, CFU Total
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Rename columns for consistency
bmeta <- bmeta %>%
  rename(CFU = `CFU Total`) %>%
  mutate(logCFU = log10(CFU + 1))

# Visualize distribution of logCFU
ggplot(bmeta, aes(x = logCFU)) +
  geom_density(fill = "blue", alpha = 0.3) +
  labs(title = "Distribution of logCFU", x = "logCFU", y = "Density") +
  theme_minimal()

# Define function to preprocess gene-specific data
prepare_ss <- function(gene) {
  ss <- brom %>% filter(gene == !!gene)  # Subset for the specific gene
  tot <- ss %>%
    group_by(biosample_id) %>%
    summarize(tot = sum(counts), .groups = "drop")
  
  ss <- ss %>%
    left_join(tot, by = "biosample_id") %>%
    left_join(bmeta, by = c("biosample_id", "CoarseClustering")) %>%
    mutate(
      lcpm = log2((counts + 0.01) / tot),
      cpm = counts / tot
    )
  
  return(ss)
}

# Perform Spearman correlation for CD4 and CD8 genes
cd4_genes <- c("TNFRSF4", "KLRB1", "CD40LG", "S100A11", "S100A4", "IL26", "BATF")
cd8_genes <- c("APOBEC3G", "IFNG", "TNF", "CCL1", "CCL20")

gene_list <- c(cd4_genes, cd8_genes)

res <- lapply(gene_list, function(g) {
  ss <- prepare_ss(g)
  unique_clusters <- unique(ss$CoarseClustering)
  
  cluster_res <- lapply(unique_clusters, function(clust) {
    tmp <- ss %>% filter(CoarseClustering == clust)
    
    # Calculate Spearman correlation
    cor_res <- cor.test(tmp$cpm, tmp$logCFU, method = "spearman")
    data.frame(
      gene = g,
      cluster = clust,
      n = nrow(tmp),
      rho = cor_res$estimate,
      pvalue = cor_res$p.value
    )
  })
  
  do.call(rbind, cluster_res)
})

# Combine results into a single data frame
res_df <- do.call(rbind, res) %>%
  arrange(pvalue) %>%
  mutate(FDRq = p.adjust(pvalue, method = "fdr"))

print(res_df)
##             gene               cluster   n          rho       pvalue
## rho99        TNF            T,NK cells 348 -0.619029412 4.357425e-38
## rho95       IL26            T,NK cells 348  0.545054217 3.049259e-28
## rho93    S100A11            T,NK cells 348 -0.437176233 1.121929e-17
## rho97   APOBEC3G            T,NK cells 348 -0.347051798 2.757506e-11
## rho1311    CCL20 Alveolar Type 2 cells  79  0.623885558 8.130907e-10
## rho69        TNF           Neutrophils  61  0.635839221 3.671837e-08
## rho14    TNFRSF4           Fibroblasts  98  0.493176174 2.479051e-07
## rho411     CCL20     Endothelial cells 166  0.387103582 2.575956e-07
## rho42     CD40LG     Endothelial cells 166  0.369715775 9.472664e-07
## rho67   APOBEC3G           Neutrophils  61  0.541112026 6.714981e-06
## rho81      KLRB1           Macrophages 185 -0.320543085 9.162651e-06
## rho49        TNF     Endothelial cells 166  0.335576469 9.880245e-06
## rho33    S100A11            Mast cells  84 -0.441355616 2.641337e-05
## rho124    S100A4                  pDCs  31  0.678032014 2.776026e-05
## rho34     S100A4            Mast cells  84 -0.429173719 4.631410e-05
## rho83    S100A11           Macrophages 185  0.292789558 5.246485e-05
## rho66       BATF           Neutrophils  61  0.488820089 6.411343e-05
## rho132    CD40LG Alveolar Type 2 cells  79  0.432988366 6.732569e-05
## rho137  APOBEC3G Alveolar Type 2 cells  79  0.410070803 1.746068e-04
## rho37   APOBEC3G            Mast cells  84 -0.396256246 1.904979e-04
## rho35       IL26            Mast cells  84 -0.394911153 2.202207e-04
## rho147  APOBEC3G           Fibroblasts  98  0.364119481 2.280592e-04
## rho104    S100A4                 cDC 1  32  0.588558895 3.952432e-04
## rho113   S100A11                 cDC 2  32  0.597507331 3.963332e-04
## rho94     S100A4            T,NK cells 348  0.182228433 6.358543e-04
## rho87   APOBEC3G           Macrophages 185  0.245077236 7.731254e-04
## rho142    CD40LG           Fibroblasts  98  0.334016379 7.759048e-04
## rho1310     CCL1 Alveolar Type 2 cells  79  0.364754670 1.197215e-03
## rho4     TNFRSF4     Endothelial cells 166  0.247966873 1.276784e-03
## rho96       BATF            T,NK cells 348 -0.169069070 1.548633e-03
## rho13    TNFRSF4 Alveolar Type 2 cells  79  0.349917826 1.571533e-03
## rho92     CD40LG            T,NK cells 348 -0.168004899 1.685766e-03
## rho10    TNFRSF4                 cDC 1  32  0.527465166 1.920709e-03
## rho1210      TNF                  pDCs  31  0.521775839 2.608582e-03
## rho133   S100A11 Alveolar Type 2 cells  79  0.334149819 2.615340e-03
## rho36       BATF            Mast cells  84 -0.317813745 3.219631e-03
## rho9     TNFRSF4            T,NK cells 348 -0.157131746 3.338380e-03
## rho84     S100A4           Macrophages 185 -0.214634308 3.348127e-03
## rho1411    CCL20           Fibroblasts  98  0.293193284 3.390691e-03
## rho3     TNFRSF4            Mast cells  84 -0.314944618 3.732179e-03
## rho88       IFNG           Macrophages 185 -0.211097666 3.922178e-03
## rho145      IL26           Fibroblasts  98 -0.283053101 4.740934e-03
## rho143   S100A11           Fibroblasts  98  0.278794650 5.438269e-03
## rho148      IFNG           Fibroblasts  98  0.278171967 5.547530e-03
## rho410      CCL1     Endothelial cells 166 -0.214360931 6.159689e-03
## rho131     KLRB1 Alveolar Type 2 cells  79  0.303707969 6.508873e-03
## rho46       BATF     Endothelial cells 166  0.208168374 7.116471e-03
## rho144    S100A4           Fibroblasts  98  0.266285121 8.042055e-03
## rho107  APOBEC3G                 cDC 1  32  0.461143695 8.484726e-03
## rho106      BATF                 cDC 1  32  0.454962584 8.889571e-03
## rho118      BATF               B cells  32 -0.453445748 9.767368e-03
## rho138      IFNG Alveolar Type 2 cells  79  0.289064485 9.774134e-03
## rho27       BATF Alveolar Type 1 cells  28  0.469311200 1.175250e-02
## rho135      IL26 Alveolar Type 2 cells  79  0.275283431 1.407222e-02
## rho146      BATF           Fibroblasts  98  0.245590285 1.479002e-02
## rho1110 APOBEC3G                 cDC 2  32  0.428885630 1.501990e-02
## rho98       IFNG            T,NK cells 348 -0.127738396 1.711956e-02
## rho86       BATF           Macrophages 185 -0.174481970 1.753106e-02
## rho910      CCL1            T,NK cells 348  0.125658451 2.122753e-02
## rho119      BATF                 cDC 2  32  0.403962746 2.185115e-02
## rho103   S100A11                 cDC 1  32  0.395161290 2.595561e-02
## rho126      BATF                  pDCs  31  0.388141080 3.095359e-02
## rho127  APOBEC3G                  pDCs  31  0.387337435 3.133313e-02
## rho89        TNF           Macrophages 185 -0.155955130 3.451405e-02
## rho130       TNF               B cells  32 -0.373670726 3.514415e-02
## rho136      BATF Alveolar Type 2 cells  79  0.231759664 3.986321e-02
## rho8     TNFRSF4           Macrophages 185  0.150409510 4.155409e-02
## rho139       TNF Alveolar Type 2 cells  79  0.229269765 4.210581e-02
## rho102    CD40LG                 cDC 1  32  0.361165813 4.226439e-02
## rho121     KLRB1                  pDCs  31  0.355726509 4.953151e-02
## rho810      CCL1           Macrophages 185 -0.146957887 5.028982e-02
## rho110   S100A11               B cells  32 -0.348240469 5.143618e-02
## rho1212    CCL20                  pDCs  31  0.347807732 5.520287e-02
## rho149       TNF           Fibroblasts  98  0.188905939 6.247782e-02
## rho64     S100A4           Neutrophils  61  0.235993881 6.710110e-02
## rho101     KLRB1                 cDC 1  32  0.321917368 7.237372e-02
## rho6     TNFRSF4           Neutrophils  61  0.231361895 7.280692e-02
## rho1010     CCL1                 cDC 1  32  0.323753733 7.561581e-02
## rho123   S100A11                  pDCs  31  0.318145161 8.150861e-02
## rho7     TNFRSF4          Plasma cells  31  0.316834483 8.245322e-02
## rho91      KLRB1            T,NK cells 348  0.092370362 8.577191e-02
## rho710      CCL1          Plasma cells  31 -0.317411102 8.741815e-02
## rho47   APOBEC3G     Endothelial cells 166 -0.132876041 8.789127e-02
## rho31      KLRB1            Mast cells  84 -0.185048387 9.397315e-02
## rho63    S100A11           Neutrophils  61  0.216162578 9.428733e-02
## rho22     CD40LG        Ciliated cells  17  0.408248290 1.037709e-01
## rho1011    CCL20                 cDC 1  32  0.291706448 1.052446e-01
## rho134    S100A4 Alveolar Type 2 cells  79  0.178182350 1.161648e-01
## rho122    CD40LG                  pDCs  31  0.284856107 1.203719e-01
## rho54     S100A4           Eosinophils  27 -0.305430069 1.213212e-01
## rho48       IFNG     Endothelial cells 166  0.120416171 1.222594e-01
## rho53    S100A11           Eosinophils  27 -0.304639805 1.223867e-01
## rho12    TNFRSF4                  pDCs  31  0.275534794 1.335409e-01
## rho75       IL26          Plasma cells  31 -0.265361389 1.490859e-01
## rho115    S100A4                 cDC 2  32  0.256231672 1.565194e-01
## rho17     CD40LG Alveolar Type 1 cells  28  0.267752418 1.683511e-01
## rho30       IFNG Alveolar Type 1 cells  28  0.267752418 1.683511e-01
## rho85       IL26           Macrophages 185 -0.099985119 1.768852e-01
## rho15      KLRB1 Alveolar Type 1 cells  28 -0.259971697 1.815396e-01
## rho      TNFRSF4 Alveolar Type 1 cells  28  0.258144262 1.847389e-01
## rho1410     CCL1           Fibroblasts  98 -0.137277155 1.870356e-01
## rho76       BATF          Plasma cells  31  0.237929057 1.974338e-01
## rho150     CCL20               B cells  32  0.232610000 2.001418e-01
## rho11    TNFRSF4                 cDC 2  32  0.230838380 2.036874e-01
## rho610      CCL1           Neutrophils  61  0.165968326 2.090128e-01
## rho125      IL26                  pDCs  31  0.230971146 2.112511e-01
## rho141     KLRB1           Fibroblasts  98 -0.120579635 2.369349e-01
## rho43    S100A11     Endothelial cells 166  0.092134761 2.377551e-01
## rho129      IFNG                  pDCs  31  0.215565922 2.441592e-01
## rho112    CD40LG                 cDC 2  32  0.211126824 2.460776e-01
## rho311      CCL1            Mast cells  84 -0.128549316 2.467858e-01
## rho52     CD40LG           Eosinophils  27  0.227128381 2.545711e-01
## rho44     S100A4     Endothelial cells 166  0.086444540 2.681131e-01
## rho41      KLRB1     Endothelial cells 166  0.085832805 2.715272e-01
## rho39        TNF Alveolar Type 1 cells  28  0.214460072 2.731269e-01
## rho109       TNF                 cDC 1  32  0.197576536 2.783987e-01
## rho114    S100A4               B cells  32 -0.196480938 2.798931e-01
## rho1114    CCL20                 cDC 2  32  0.189790134 2.981568e-01
## rho611     CCL20           Neutrophils  61  0.131505064 3.123828e-01
## rho911     CCL20            T,NK cells 348 -0.053358059 3.216513e-01
## rho1     TNFRSF4               B cells  32  0.178812645 3.274838e-01
## rho108      IFNG                 cDC 1  32  0.176823479 3.329817e-01
## rho23    S100A11        Ciliated cells  17  0.233650087 3.667548e-01
## rho57   APOBEC3G           Eosinophils  27  0.180014364 3.689232e-01
## rho140      CCL1               B cells  32 -0.163299316 3.800729e-01
## rho1112      TNF                 cDC 2  32  0.158503094 3.862424e-01
## rho120  APOBEC3G               B cells  32  0.155791789 3.929650e-01
## rho210  APOBEC3G        Ciliated cells  17 -0.215091402 4.070683e-01
## rho811     CCL20           Macrophages 185 -0.060011666 4.183886e-01
## rho50      CCL20 Alveolar Type 1 cells  28  0.154906274 4.312326e-01
## rho68       IFNG           Neutrophils  61 -0.101723839 4.353444e-01
## rho73    S100A11          Plasma cells  31 -0.140665946 4.503841e-01
## rho20     S100A4 Alveolar Type 1 cells  28  0.148189362 4.517123e-01
## rho16      KLRB1               B cells  32 -0.136979577 4.547141e-01
## rho79        TNF          Plasma cells  31  0.130578295 4.838279e-01
## rho56       BATF           Eosinophils  27 -0.139949732 4.862859e-01
## rho1211     CCL1                  pDCs  31 -0.114811440 5.457568e-01
## rho214     CCL20        Ciliated cells  17  0.151486360 5.616548e-01
## rho711     CCL20          Plasma cells  31  0.105993461 5.703762e-01
## rho61      KLRB1           Neutrophils  61  0.074016215 5.707792e-01
## rho74     S100A4          Plasma cells  31  0.101352538 5.874620e-01
## rho19    S100A11 Alveolar Type 1 cells  28 -0.105927951 5.916296e-01
## rho28       BATF        Ciliated cells  17  0.139895844 5.922933e-01
## rho511     CCL20           Eosinophils  27 -0.103605389 6.070682e-01
## rho51      KLRB1           Eosinophils  27 -0.101330998 6.150245e-01
## rho38       IFNG            Mast cells  84  0.054029952 6.254581e-01
## rho59        TNF           Eosinophils  27  0.096019877 6.337692e-01
## rho1111     IFNG                 cDC 2  32  0.082539708 6.533551e-01
## rho58       IFNG           Eosinophils  27 -0.087357070 6.648174e-01
## rho82     CD40LG           Macrophages 185 -0.030964460 6.764882e-01
## rho26       IL26        Ciliated cells  17  0.102062073 6.966992e-01
## rho116      IL26               B cells  32 -0.068083207 7.112008e-01
## rho105      IL26                 cDC 1  32 -0.068083207 7.112008e-01
## rho18     CD40LG               B cells  32  0.064950514 7.239603e-01
## rho71      KLRB1          Plasma cells  31  0.061648897 7.418121e-01
## rho212       TNF        Ciliated cells  17  0.084753795 7.463818e-01
## rho32     CD40LG            Mast cells  84 -0.035855065 7.476006e-01
## rho72     CD40LG          Plasma cells  31 -0.056971221 7.608122e-01
## rho510      CCL1           Eosinophils  27  0.057867175 7.743430e-01
## rho24     S100A4        Ciliated cells  17  0.071345695 7.855408e-01
## rho78       IFNG          Plasma cells  31  0.041650535 8.239504e-01
## rho310       TNF            Mast cells  84  0.021566239 8.465513e-01
## rho1113     CCL1                 cDC 2  32 -0.031109769 8.680516e-01
## rho312     CCL20            Mast cells  84 -0.015969259 8.860615e-01
## rho29   APOBEC3G Alveolar Type 1 cells  28  0.026705947 8.926945e-01
## rho111     KLRB1                 cDC 2  32 -0.023492377 8.984473e-01
## rho128      IFNG               B cells  32  0.013993298 9.394094e-01
## rho77   APOBEC3G          Plasma cells  31 -0.009880028 9.579309e-01
## rho62     CD40LG           Neutrophils  61 -0.003575747 9.781807e-01
## rho21      KLRB1        Ciliated cells  17  0.006586363 9.799849e-01
## rho5     TNFRSF4           Eosinophils  27  0.000000000 1.000000e+00
## rho2     TNFRSF4        Ciliated cells  17           NA           NA
## rho25       IL26 Alveolar Type 1 cells  28           NA           NA
## rho45       IL26     Endothelial cells 166           NA           NA
## rho55       IL26           Eosinophils  27           NA           NA
## rho65       IL26           Neutrophils  61           NA           NA
## rho117      IL26                 cDC 2  32           NA           NA
## rho211      IFNG        Ciliated cells  17           NA           NA
## rho40       CCL1 Alveolar Type 1 cells  28           NA           NA
## rho213      CCL1        Ciliated cells  17           NA           NA
##                 FDRq
## rho99   7.451197e-36
## rho95   2.607116e-26
## rho93   6.394996e-16
## rho97   1.178834e-09
## rho1311 2.780770e-08
## rho69   1.046474e-06
## rho14   5.506106e-06
## rho411  5.506106e-06
## rho42   1.799806e-05
## rho67   1.148262e-04
## rho81   1.407935e-04
## rho49   1.407935e-04
## rho33   3.390718e-04
## rho124  3.390718e-04
## rho34   5.279807e-04
## rho83   5.607180e-04
## rho66   6.395940e-04
## rho132  6.395940e-04
## rho137  1.571462e-03
## rho37   1.628757e-03
## rho35   1.772642e-03
## rho147  1.772642e-03
## rho104  2.823874e-03
## rho113  2.823874e-03
## rho94   4.349244e-03
## rho87   4.914064e-03
## rho142  4.914064e-03
## rho1310 7.311565e-03
## rho4    7.528621e-03
## rho96   8.668779e-03
## rho13   8.668779e-03
## rho92   9.008311e-03
## rho10   9.952764e-03
## rho1210 1.277780e-02
## rho133  1.277780e-02
## rho36   1.486687e-02
## rho9    1.486687e-02
## rho84   1.486687e-02
## rho1411 1.486687e-02
## rho3    1.595507e-02
## rho88   1.635835e-02
## rho145  1.930237e-02
## rho143  2.155972e-02
## rho148  2.155972e-02
## rho410  2.340682e-02
## rho131  2.419603e-02
## rho46   2.589184e-02
## rho144  2.864982e-02
## rho107  2.960996e-02
## rho106  3.040233e-02
## rho118  3.214186e-02
## rho138  3.214186e-02
## rho27   3.791843e-02
## rho135  4.456203e-02
## rho146  4.586433e-02
## rho1110 4.586433e-02
## rho98   5.135868e-02
## rho86   5.168639e-02
## rho910  6.152386e-02
## rho119  6.227577e-02
## rho103  7.276082e-02
## rho126  8.504707e-02
## rho127  8.504707e-02
## rho89   9.221722e-02
## rho130  9.245614e-02
## rho136  1.032820e-01
## rho8    1.047422e-01
## rho139  1.047422e-01
## rho102  1.047422e-01
## rho121  1.209984e-01
## rho810  1.211206e-01
## rho110  1.221609e-01
## rho1212 1.293108e-01
## rho149  1.443744e-01
## rho64   1.529905e-01
## rho101  1.616881e-01
## rho6    1.616881e-01
## rho1010 1.657731e-01
## rho123  1.762437e-01
## rho7    1.762437e-01
## rho91   1.810740e-01
## rho710  1.810772e-01
## rho47   1.810772e-01
## rho31   1.896839e-01
## rho63   1.896839e-01
## rho22   2.063352e-01
## rho1011 2.068601e-01
## rho134  2.257293e-01
## rho122  2.274796e-01
## rho54   2.274796e-01
## rho48   2.274796e-01
## rho53   2.274796e-01
## rho12   2.455429e-01
## rho75   2.712095e-01
## rho115  2.817349e-01
## rho17   2.967838e-01
## rho30   2.967838e-01
## rho85   3.086466e-01
## rho15   3.135684e-01
## rho     3.159035e-01
## rho1410 3.166643e-01
## rho76   3.309920e-01
## rho150  3.322743e-01
## rho11   3.349091e-01
## rho610  3.403923e-01
## rho125  3.407918e-01
## rho141  3.764456e-01
## rho43   3.764456e-01
## rho129  3.801836e-01
## rho112  3.801836e-01
## rho311  3.801836e-01
## rho52   3.886755e-01
## rho44   4.057286e-01
## rho41   4.061278e-01
## rho39   4.061278e-01
## rho109  4.090746e-01
## rho114  4.090746e-01
## rho1114 4.320747e-01
## rho611  4.488862e-01
## rho911  4.583531e-01
## rho1    4.628077e-01
## rho108  4.667203e-01
## rho23   5.087570e-01
## rho57   5.087570e-01
## rho140  5.199397e-01
## rho1112 5.241861e-01
## rho120  5.291103e-01
## rho210  5.438178e-01
## rho811  5.546082e-01
## rho50   5.672367e-01
## rho68   5.682740e-01
## rho73   5.802695e-01
## rho20   5.802695e-01
## rho16   5.802695e-01
## rho79   6.114330e-01
## rho56   6.114330e-01
## rho1211 6.812001e-01
## rho214  6.959636e-01
## rho711  6.971660e-01
## rho61   6.971660e-01
## rho74   7.082668e-01
## rho19   7.082668e-01
## rho28   7.082668e-01
## rho511  7.208935e-01
## rho51   7.253048e-01
## rho38   7.325570e-01
## rho59   7.372417e-01
## rho1111 7.548900e-01
## rho58   7.629784e-01
## rho82   7.711965e-01
## rho26   7.889773e-01
## rho116  7.948715e-01
## rho105  7.948715e-01
## rho18   8.038780e-01
## rho71   8.142656e-01
## rho212  8.142656e-01
## rho32   8.142656e-01
## rho72   8.234106e-01
## rho510  8.327840e-01
## rho24   8.395467e-01
## rho78   8.751275e-01
## rho310  8.935819e-01
## rho1113 9.106554e-01
## rho312  9.238812e-01
## rho29   9.251561e-01
## rho111  9.255089e-01
## rho128  9.619102e-01
## rho77   9.750368e-01
## rho62   9.857495e-01
## rho21   9.857495e-01
## rho5    1.000000e+00
## rho2              NA
## rho25             NA
## rho45             NA
## rho55             NA
## rho65             NA
## rho117            NA
## rho211            NA
## rho40             NA
## rho213            NA

Note how higher TNF expression is associated with lower CFU when expressed in T,NK cells, but is associated with higher CFU when expressed in neutrophils. There are several genes with this pattern. Is it driven by NHP in one group or another? Or is it driven by the group effect overall?

Visualize single gene expression by cluster and group

It’s helpful to visualize this pattern for one gene using boxplots by group and cell cluster. It might be interesting to look at the sub-clusters too.

# Extract the top gene and cluster
top_gene <- res_df$gene[1]
top_cluster <- res_df$cluster[1]

# Prepare data for visualization
ss <- prepare_ss(top_gene)

# Ensure Group has the specified ordering
ss <- ss %>%
  mutate(Group = factor(Group, levels = c("IgG", "antiCD4", "Naïve")))

# Visualization: Boxplot with stripplot overlay
ggplot(ss, aes(x = cpm, y = CoarseClustering, color = Group)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(position = position_jitterdodge(jitter.width = 0.2), alpha = 0.7) +
  labs(
    title = paste("Expression of", top_gene, "by Coarse Clustering"),
    x = paste(top_gene, "expression (log2-CPM)"),
    y = "Coarse Clustering"
  ) +
  theme_minimal()
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

Visualize correlation of one gene in one cluster with CFU

# Visualization: Scatterplot of cpm vs logCFU for the top cluster
ss_top <- ss %>% filter(CoarseClustering == top_cluster)

ggplot(ss_top, aes(x = cpm, y = logCFU)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = paste("Scatterplot of", top_gene, "expression in", top_cluster),
    x = paste(top_gene, "expression (log2-CPM)"),
    y = "logCFU"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).